Design of a lean interface for Sanskrit corpus annotation
نویسندگان
چکیده
We describe an innovative computer interface designed for assisting annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpus. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting on the sandhi rules used, and aligning on the input sentence. We show that this representation allows an exponential saving, both in space and time. This interface has been implemented, and has been applied to the annotation of the Sanskrit Library corpus. 1 Generalities on Sanskrit linguistics Sanskrit is the primary culture-vehicle language of India. It has had a continuous production of literature in all fields of human endeavour over the course of four millennia, giving rise to an immense corpus which is to this date only partially digitalized. It benefits from a very sophisticated linguistic tradition stemming from the fairly complete grammar composed by Pān. ini by the fourth century B.C.E. During the last 15 years, a significant effort at developing Sanskrit Computational linguistics has been endeavoured, and considerable progress has been achieved at providing computer assistance at Sanskrit corpus processing (Scharf and Hyman, 2009; Huet et al., 2009; Kulkarni and Huet, 2009; Jha, 2010; Kulkarni et al., 2010; Kumar et al., 2010; Kulkarni and Shukl, 2009; Goyal et al., 2009; Hellwig, 2009; Goyal et al., 2012). Nevertheless, there does not exist at this date a complete analyser for Classical Sanskrit texts able to compute reliably morphological taggings in a completely automatic way. The main difficulty concerns segmentation, since Sanskrit is represented in writing by continuous phonetic enunciation, which demands complex processing for its analysis in separate word forms. Although complete algorithms for this segmentation preprocessing have been proposed (Huet, 2005), human assistance is still needed to focus on the intended solution within all possible analyses. We propose in this paper a new human-machine interface to help a professional annotator to decide quickly between all possible segmentations in order to select a unique morphological analysis among the many possible ones. Indeed, there exist thousands of such segmentations for simple sentences, and literally billions for complex ones. Once a sufficient amount of tagged corpus is available using such semi-automated annotation tools, it is hoped that it will be possible to use it for training a fully automated parser using statistical methods. 2 Segmentation analysis We are going to formalize the segmentation problem at various levels of abstraction. Firstly, we assume that Sanskrit text is represented as a list of phonemes. Sanskrit may be written in all Indian scripts, most usually in the Devanāgarı̄ script used by languages of North India such as Hindi, but such syllabic representation is awkward for morpho-phonetics computations, which operate at the phoneme level. It is thus preferable to translate the input into a list of phonemes, such translation being one-one. We assume the standard set of 50 phonemes, already known from the time of Pān. ini. Such low-level representation issues are discussed at length in (Scharf and Hyman, 2009;
منابع مشابه
Annotating Sanskrit Corpus: Adapting IL-POSTS
In this paper we present an experiment on the use of the hierarchical Indic Languages POS Tagset (IL-POSTS) (Baskaran et al 2008 a&b) , developed by Microsoft Research India (MSRI) for tagging Indian languages, for annotating Sanskrit corpus. Sanskrit is a language with richer morphology and relatively free word-order. The authors have included and excluded certain tags according to the require...
متن کاملAn annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملSanskritTagger: A Stochastic Lexical and POS Tagger for Sanskrit
SanskritTagger is a stochastic tagger for unpreprocessed Sanskrit text. The tagger tokenises text with a Markov model and performs part-of-speech tagging with a Hidden Markov model. Parameters for these processes are estimated from a manually annotated corpus of currently about 1.500.000 words. The article sketches the tagging process, reports the results of tagging a few short passages of Sans...
متن کاملAn Effort to Develop a Tagged Lexical Resource for Sanskrit
In this paper we present our efforts the first time of its kind in the history of Sanskrit to design and develop a structured electronic lexical Resource by tagging a Traditional Sanskrit dictionary. We narrate how the whole unstructured raw text of Vaacaspatyam – an encyclopedic type of Sanskrit Dictionary has been tagged to form a user friendly e-lexicon with structured and segregated informa...
متن کاملDesign & Analysis of an Exhaustive Algorithm for Sandhi Processing In Sanskrit
––It is almost impossible to learn a new language without the study of it’s grammar .Automated language processing is in real centrally focused to drive to enable facilitated referencing of increasingly available Sanskrit E-texts. For learning Sanskrit language , the study of it’s grammar plays a very important role .Proposed research paper presents a fresh and new approach to processing Sandhi...
متن کامل